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Abstract 

A rigorous methodology is proposed to study cell division data consisting in several ob- 
served genealogical trees of possibly different shapes. The procedure takes into account missing 
observations, data from different trees, as well as the dependence structure within genealogical 
trees. Its main new feature is the joint use of all available information from several data sets 
instead of single data set estimation, to avoid the drawbacks of low accuracy for estimators or 
low power for tests on small single-trees. The data is modeled by an asymmetric bifurcating 
autoregressive process and possibly missing observations are taken into account by modeling 
the genealogies with a two-type Galton- Watson process. Least-squares estimators of the un- 
known parameters of the processes are given and symmetry tests are derived. Results are 
applied on real data of Escherichia coli division and an empirical study of the convergence 
rates of the estimators and power of the tests is conducted on simulated data. 



1 Introduction 



Cell lineage data consist of observations of some quantitative characteristic of the cells (e.g. their 
length, growth rate, time until division, . . . ) over several generations descended from an initial cell. 
Track is kept of the genealogy to study the inherited effects on the evolution of the characteristic. As 
a cell usually gives birth to two offspring by division, such genealogies are structured as binary trees. 



Cowan and Staudte ( 1986 ) first adapted autoregressive processes to this binary tree structure by 



introducing bifurcating autoregressive processes (BAR). This parametric model takes into account 
both the environmental and inherited effects. Inference on this model has been proposed based 
on either a single-tree growing to infinity, see e.g. Cowan and Staudte (19861, Huggins (1996), 



Huggins and Basawa (20001, Zhou and Basawa (20051 or for an asymptotically infinite number of 
small replicated trees, see e.g. Huggins and Staudte (1994), Huggins and Basawa (1999). 



More recently, studies of aging in single cell organisms by Stewart et al ( 2005 1 suggested that 



cell division may not be symmetric. An asymmetric BAR model was therefore proposed by lGuyon] 
(2007), where the two sets of parameters corresponding to sister cells are allowed to be different. 



Inference for this model was only investigated for single-trees growing to infinity, see Guyon ( 2007 ) , 



Bercu et al (2009) for the fully observed model or Delmas and Marsalle (2010), de Saporta et al 



(2011 1, de Saporta et al (2012) for missing data models 



Cell division data often consist in recordings over several genealogies of cells evolving in similar 
experimental conditions. For instance, Stewart et al (2005) filmed 94 colonies of Escherichia coli 



cells dividing between four and nine times. We therefore propose a new rigorous approach to take 
into account all the available information. Indeed, we propose an inference based on a finite fixed 
number of replicated trees when the total number of observed cells tends to infinity. We use the 
missing data asymmetric BAR model introduced by de Saporta et al (2011). In this approach, 
the observed genealogies are modeled with a two- type Galton- Watson (GW) process. However, 
we propose a different least-squares estimator for the parameters of the BAR process that does 
not correspond to the single-tree estimators averaged on the replicated trees. We also propose an 
estimator of the parameters of the GW process specific to our binary tree structure and not based 
simply on the observation of the number of cells of each type in each generation as in |Guttorp] 
(19911, Maaouia and Touati (20051. We study the consistency and asymptotic normality of our 
estimators and derive asymptotic confidence intervals as well as Wald's type tests to investigate 
the asymmetry of the data for both the BAR and GW processes. Our results are applied to the 
Escherichia coli data of Stewart et al ( 2005 1 . We also provide an empirical study of the convergence 
rate of our estimators and of the power of the symmetry tests on simulated data. 

The paper is organized as follows. In Section [2] we describe a methodology for least-squares 
estimation based on multiple data sets in a general framework. In Section [3] we present the BAR 
and observations models. In Section|4]we give our estimators and state their asymptotic properties. 
In Section |5] we propose a new investigation of Stewart et al ( 2005 1 data. In Section [6] we give 
simulation results. The precise statement of the convergence results, the explicit form of the 
asymptotic variance of the estimators and the convergence proofs are postponed to the appendix. 



2 Methodology 

We work with the following general framework. Consider that several data sets are available, 
obtained in similar experimental conditions and then assumed to come from the same parametric 
model. Suppose that there exists a consistent least-squares estimator for the parametric model. 
This estimator can be computed on each individual data set, but we would like to take into account 
all the data at disposal, which should improve the accuracy of the estimation. 

To this aim, we assume that the different data sets are independent realizations of the parametric 
model. A natural idea is to average the single-set estimators. It may be a good approach if the 
single-set estimators have roughly the same variance, which is usually the case when the data sets 
have the same size. However, if the data sets have very different sizes, the single-set estimators 
may have variances of different orders and this direct approach becomes dubious. 

Instead, we propose to use a global least-squares estimator. Suppose that we have m data sets. 
Let be the (possibly multivariate) parameter to be estimated, and Oj^n the least-squares estimator 
build with the j-th data set for 1 < j < m. The global least-squares estimator 9^ decomposes as 

On= ^-J^") X! ^J."' 

where Sj^„ is a normalizing matrix and Vj^„ a vector of the same size as 6, involved in the 
decomposition of the single-set least-squares estimator 6*^ „ as follows 

Note that the estimator 0„ thus constructed is neither an average nor a function of the 
Hence, the asymptotic behavior of the global estimator 6'„ cannot be deduced from that of the 
single-set estimators 9j^n- Nevertheless, the asymptotic behavior of 0j.„ is often obtained through 
the convergence of the normalizing matrices Sj^n and of the vectors Vj^„ separately, which gives 
the convergence of the global estimator On as the number m of data sets is fixed. Note that the 
asymptotic is not the number m of data sets. 

The aim of this paper is to apply this methodology to cell division data with missing data. 
In this special case, the convergence of the global estimator 6'„ is not straightforward, because we 
have to prove it on a set where the convergence of each Sj.„ and Vj.„ is not ensured. 
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3 Model 



Our aim is to estimate the parameters of coupled BAR and GW processes through m i.i.d. real- 
izations of the processes. We first define our parametric model and introduce our notations. The 



BAR and GW processes have the same dynamics as in de Saporta et al (2011 1, the main difference 
is that our inference is here based on several i.i.d. realizations of the processes, instead of a single 
one. Additional notations together with the precise technical assumptions are specified in |X] 

3.1 Bifurcating autoregressive model 

Consider m i.i.d. replications of the asymmetric BAR process with coefficient 9 ={ao,bQ,ai,bi) € 
M^. More precisely, for 1 < j < m, the first cell in genealogy j is labelled (j, 1) and for fc > 1, 
the two offspring of cell (j, fc) are labelled (j, 2fc) and {j,2k + 1). As we consider an asymmetric 
model, each cell has a type defined by its label: {j, 2k) has type even and (j, 2k + 1) has type odd. 
The characteristic of cell k in genealogy j is denoted by ^(j,fc) . The BAR processes are defined 
recursively as follows: for all 1 < j < m and fc > 1, one has 

^{j,2k) = flO + &0-'^(j,fe) + ^{j,2k), ^--j^-j 

^{],2k+l} = ai + + e(j,2fc+l)- 

Let us also define the variance and covariance of the noise sequence 

Our goal is to estimate the parameters 6 —{uq, h^, oi, bi) and {a^, af, p), and then test if (ap, bo) = (ai, bi) 
or not. 



3.2 Observation process 

We now turn to the observation process ((5(j,fc)) that encodes for the presence or absence of cell 
measurements in the available data 



1 if cell fc in genealogy j is observed, 
if cell fc in genealogy j is not observed. 



To take into account possible asymmetry in the observation process, we use a two-type Galton- 
Watson model. The relevance of this model to E. coli data is discussed in section [5) Again, we 
suppose all the m observation processes to be drawn independently from the same two-type GW 
process. More precisely, for all 1 < j < m, we model the observation process (i5(j,fe))fe>i for the 
j-th genealogy as follows. We set = 1 and draw 2A:)5 '5(j.2fc+i)) independently from one 

another with a law depending on the type of cell fc. More precisely, for i £ {0, 1}, if fc is of type i 
we set 

IP((<50\2fc),%,2fe+l)) = (^0,'l) S(j^k) = l) = P^^H'O,^!), 



P(('50-,2fc),%2fe+l)) = (0,0) I = O) = 1, 



for all {Iq, li) e {0, 1}^. Thus, p*^*)(^o, ^i) is the probability that a cell of type i has Iq daughter of 
type and h daughter of type 1. And if a cell is missing, its descendants are missing as well. Figure 
^ gives an example of realization of an observation process. We also assume that the observation 
processes are independent from the BAR processes. 



4 Inference 

Our first goal is to estimate the reproduction probabilities p*^*^ (Zq, h) of the GW process from the m 
genealogies of observed cells up to the n-th generation to be able to test the symmetry of the GW 
model itself. Our second goal is to estimate 6 = (aq, feoi ^ii ^i)* from all the observed individuals 
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Figure 1: A tree of observed cells. 



of the m trees up to the n-th generation. We then give the asymptotic properties of our estimator 
to be able to build confidence intervals and symmetry tests for 9. 

Denote by |T* | the total number of observed cells in the m trees up to the n-th generation of 
offspring from the original ancestors, and let 

£-{lim |m=(^} 

n— f oo 

be the non-extinction set, on which the global cell population grows to infinity. Thus, our asymp- 
totic results only hold on the set £. This global non-extinction set is the union and not the 
intersection of the non-extinction sets of each single-tree. It means that some trees may extinct, 
which allows us to take into account trees with a different number of observed generations. We 
are thus in a case where averaging single-tree estimators is not recommended. The possibility of 
extinction for some trees is also the reason why the convergence of the multiple-trees estimator 
9n is not straightforward from existing results in the literature. Conditions for the probability of 
non-extinction to be positive are given in ^ 



4.1 Estimation of the reproduction law of the GW process 



There are many references on the inference of a multi-type GW process, see for instance ( Guttorp 



19911 and (Maaouia and Touati 2005). Our context of estimation is very specific because the 



information given by {&(j^k)) is more precise than that given by the number of cells of each type in 
a given generation that is usually used in the literature. Indeed, not only do we know the number 
of cells of each type in each generation, but we also know their precise positions on the binary tree 
of cell division. The empiric estimators of the reproduction probabilities using data up to the n-th 
generation are then, for i, Zq, in {0, 1} 



Em 
7 = 1 Z^fc 



6T„ 



, <^(i,2fe+j)0/o(<^O-,2(2fc+i)))'^il(<^(i,2(2fc+i) + l)) 



1=1 1^ 



^(j,2fe+i) 



where (poix) = 1 — 2;, 4)i{x) = x, and if the denominator is non zero, the estimator equalling zero 
otherwise. Note that the numerator is just the number of cells of type i in all the trees up to 
generation n — 1 that have exactly Iq daughter of type and li daughter of type 1 in the n-th 
generation. The denominator is the total number of cells of type i in all the trees up to generation 
n — 1. Set also 

p« = (p«(l,l),p«(l,0),p«(0,l),p«(G,0))*, 

the vector of the 4 reproduction probabilities for a mother of type z, p = ((p'-^-')*, (p*-^-*)*)* the 
vector of all 8 reproduction probabilities and p„ its empirical estimator. 
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4.2 Least-squares estimation for the BAR parameters 

For the parameters of the BAR process, we use the standard least-squares (LS) estimator 0„ with 
all the available data from the m trees up to generation n. It minimizes 

m 

^n{d) = ^ ^ ^(j,2k){X(j^2k) - Oo - boX(^j^k))'^ + S(j^2k+l){X(j^2k+l) - - 6lX(j,fc))^. 

j=i fceT„_i 

Consequently, for all n > 1 we have 6'„ — (ao,n, oi.n, with 

m 

^n=^rT-lX! X! (%,2fe)-'^0-,2fe), %,2fe)-'^0\fe)-''^(j,2fe) , %,2fe+l)-'^(j,2fe+l) , %,2fc+l)-'^(j,fe)^(j,2fc+l)) 
J = l /ceT„_i 

(2) 

where, for i E {0, 1} we defined 

Note that in the normalizing matrices the sum is over all observed cells for which a daughter 
of type i is observed, and not merely over all observed cells. To estimate the variance parameters 
erf and p, we define the empiric residuals. For all 2^ < k < 2^+^ — 1 and 1 < j < m set 

I £{j,2k) = S(^j^2k){X(j^2k) - aa.t - bo,iX(^j^k)), 

[ £(j,2fc+l) = '5(j,2fc+l)(-'^(j,2fc+l) — ai,£ — bi^lX(j^k)). 

We propose the following empirical estimators 



i=ifeeT„_i 



. |ii„_i| ^■^ifegT„_i 



where |T** | is the set of all cells which have at least one offspring of type z, for i e {0, 1} and |T*°^ | 
is the set of all the cells which have exactly two offspring, in the m trees up to generation n. 

4.3 Consistency and normality 

We now state the convergence results we obtain for the estimators above. The assumptions (H.l) 
to (H.6) are given in 



A.2 



These results hold on the non-extinction set 8. 

Theorem 4.1 Under assumptions (H.5-6) and for all i, Iq and li in {0, 1}, pti\lo, h) converges 
to p^'^\Iq^Ii) almost surely on E. Under assumptions (H.0-6), CTpru n '^^'^ Pn converge to 6, 
CTg, and p respectively, almost surely on 8 . 

The asymptotic normality results are only valid conditionally to the non-extinction of the global 
cell population. 

Theorem 4.2 Under assumptions (H.5-6) we have 

^|t:_i|(p„-p)4aa(o,v), 

and under assumptions (H.0-6), we have 



,fiKK\iPn-p) A AA(0,7p), 

conditionally to £. The explicit form of the variance matrices V, Tq, Fg. and of jp is given in 
Eq. ^ and {10) respectively. 

The proofs of these results are given in |B.2| and |B.3| for the G W process and in |C.2| and |C.3| for 
the BAR process. From the asymptotic normality, one can naturally construct confidence intervals 
and tests. Their explicit formulas are given in|B.4|and |C.4[ 
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5 Data analysis 



We applied our procedure to the Escherichia coH data of Stewart at al ( 2005 1 . The biological issue 
addressed is aging in single cell organisms. E. coli is a rod-shaped bacterium that reproduces by 
dividing in the middle. Each cell has thus a new pole (due to the division of its mother) and an 
old one (one of the two poles of its mother). The cell that inherits the old pole of its mother is 
called the old pole cell, the other one is called the new pole cell. Therefore, each cell has a type: 
old pole or new pole cell , inducing asymmetry in the cell division. On a binary tree, the new pole 
cells are labelled by an even number and the old pole cells by an odd number. 



Stewart et al^ (2005 j filmed 94 colonies of dividing E. coli cells, determining the complete lineage 
and the growth rate of each cell. The number of divisions goes from four to nine. The 94 data 
sets gather |Tg| = 22394 data (11189 of type even and 11205 of type odd). Not a single data tree 
is complete. Missing data mainly do not come from cell death (only 16 cells are recorded to die) 
but from measurement difficulties due mostly to overlapping cells or cells wandering away from the 
field of view. Note also that for a growth rate to be recorded, the cell needs to be observed through 
its whole life cycle. If this is not the case, there is no record at all, so that a censored data model is 
not relevant. The observed average growth rate of even (resp. odd) cells is 0.0371 (resp. 0.0369). 
These data were investigated in ( Stewart et al 2005 Guyon et al [2005 Guyon 2007[ de Saporta 



et al 2011 de Saporta et al 20121. 



Stewart et al ( 2005 1 proposed a statistical study of the averaged genealogy and pair- wise com- 



parison of sister cells. They concluded that the old pole cells exhibit cumulatively slowed growth, 
less offspring biomass production and an increased probability of death whereas single-experiment 
analyses did not. However they assumed independence between the averaged couples of sister cells, 
which does not hold in such genealogies. 

The other studies are based on single-tree analyses instead of averaging all the genealogical 
trees. Guyon et al ( 2005[ ) model the growth rate by a Markovian bifurcating process, but their 
procedure does not take into account the dependence between pairs of sister cells either. The 
asymmetry was rejected (p-value< 0.1) in half of the experiments so that a global conclusion 
was difficult. Guyon (2007) has then investigated the asymptotical properties of a more general 
asymmetric Markovian bifurcating autoregressive process, and he rigorously constructed a Wald's 
type test to study the asymmetry of the process. However, his model does not take into account 
the possibly missing data from the genealogies. The author investigates the method on the 94 
data sets but it is not clear how he manages missing data. More recently, de Saporta et al (2011 1 
proposed a single-tree analysis with a rigorous method to deal with the missing data and carried 
out their analysis on the largest data set, concluding to asymmetry on this single set. Further 
single-tree studies of the 51 data sets issued from the 94 colonies containing at least 8 generations 
were conducted in de Saporta et al (2012). The symmetry hypothesis is rejected in one set out 
of four for {qq, 6o) = (ai, ^i) and one out of eight for ao/(l — 6o) = — &i) forbidding a global 

conclusion. Simulation studies tend to prove that the power of the tests on single-trees is quite low 
for only eight or nine generations. This is what motivated the present study and urged us to use 
all the data available in one global estimation, rather than single-tree analyses. 

In this section, we propose a new investigation of E. coli data of (Stewart et al 2005) where for 
the first time the dependence structure between cells within a genealogy is fully taken into account, 
missing data are taken care of rigorously, all the available data, i.e. the 94 sets, are analyzed at 
once and both the growth rate and the number/type of descendants are investigated. It is sensible 
to consider that all the data sets correspond to BAR processes with the same coefficients as the 
experiments where conducted in similar conditions. Moreover, a direct comparison of single-tree 
estimations would be meaningless as the data trees do not all have the same number of generations, 
and it would be impossible to determine whether variations in the computed single-tree estimators 
come from an intrinsic variability between trees or just the low accuracy of the estimators for 
small trees. The original estimation procedure described in Section |2] enables us to use all the 
information available without the drawbacks of low accuracy for estimators or low power for tests 
on small single-trees. 
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5.1 Symmetry of the BAR process 



We now give the results of our new investigation of the E. coH growth rate data of ( Stewart et al 



2005 1. We suppose that the growth rate of cells in each lineage is modeled by the BAR process 
defined in Eq. ([ij and observed through the two- type GW process defined in section 3.2 The 



experiments were independent and lead in the same conditions corresponding to independence and 
identical distribution of the processes ^(j^.)), 1 < j < m. 

We first give the point and interval estimation for the various parameters of the BAR process. 
Tablejl] gives the estimation dg of 9 with the 95% confidence interval (CI) of each coefficient together 
with an estimation of aj/(l — 5^). This value is interesting in itself as 0^/(1 — 6^) is the fixed point of 
the equation E[X2A:+i] = + 6iE[Xi]. Thus it corresponds to the asymptotic mean growth rate of 
the cells in the lineage always inheriting the new pole from the mother (i = 0) or always inheriting 
the old pole (i = 1). The confidence intervals of bo and bi show that the non explosion assumption 
l&ol < 1 and |5i| < 1 is satisfied. Note that although the number of observed generations n — 9 may 
seem too small to obtain the consistency of our estimators, Theorem |4.2| shows that their variance 
is of order |T* |^^/^. Here the total number of observed cells is high enough as |Tg| = 22394. In 
addition, an empirical study of the convergence rate on simulated data is conducted in the next 
section to validate that 9 observed generations is enough. 



parameter 


estimation 


CI 


parameter 


estimation 


CI 


ao 


0.0203 


[0.0202; 0.0204] 


ai 


0.0195 


[0.0194; 0.0196] 


bo 


0.4615 


[0.4417; 0.4812] 


bi 


0.4782 


[0.4631; 0.4933] 


ao/(l - bo) 


0.03773 


[0.03756; 0.03790] 


ai/(l-foi) 


0.03734 


[0.03717; 0.03752] 



Table 1: Estimation and 95 % CI of and a,/{l-bi). 

Table [2] gives the estimations df^ of af and pg of p with the 95% CI of each coefficient. The 
hypothesis of equality of variances CTq = al is not rejected (p-value= 0.19). From the biological 
point of view, this result is not surprising as the noise sequence represents the inffuence of the 
environment and both sister cells are born and grow in the same local environment. 

We now turn to the results of symmetry tests. The hypothesis of equality of the couples 
(oq, bo) — (ai, hi) is strongly rejected (p-value = 10^^). The hypothesis of the equality of the two 
fixed points ao/(l — bo) and ai/(l — 5i) of the BAR process is also rejected (p-value = 2-10^^). We 
can therefore rigorously confirm that there is a statistically significant asymmetry in the division 
of E. coli. Biologically we can thus conclude that the growth rates of the old pole and new pole 
cells do have different dynamics. This is interpreted as aging for the single cell organism E. coli. 



see Stewart et al (20051; Wang et al (20101 



5.2 Symmetry of the GW process 

Let us now turn to the asymmetry of the GW process itself. Note that to our best knowledge. 



it is the first time this question is investigated for the E. coli data of (Stewart et al 20051. We 
estimated the parameters p^*^ (lo, h) of the reproduction laws of the underlying GW process. Table 
[3] gives the estimations p"Q\lo,h) of the p^^'' {Iq,Ii). The estimation of the dominant eigenvalue tt 



of the descendants matrix of the GW processes (characterizing extinction, see A.ll is ttq — 1.204 
with CI [1.191; 1.217]. The non-extinction hypothesis (tt > 1) is thus satisfied. 

The means of the two reproduction laws p(°) and p*^^-' are estimated at mg = 1.2048 and 
m| = 1.2032 respectively. The hypothesis of the equality of the mean numbers of offspring is not 
rejected (p-value — 0.9). However, Table [sjshows that there is a statistically significative difference 
between vectors p'*^^ and p'^^-' as none of the confidence intervals intersect. Indeed, the symmetry 
hypothesis p^^^ = p*^^-' is rejected with p-value = 2 • 10^^. However, it is not possible to interpret 
this asymmetry in terms of the division of E. coli, since the cause of missing data is mostly due to 
observation difficulties rather than some intrinsic behavior of the cells. 
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parameter estimation 



CI 





2.28 


10-5 


[0.88 


10- 


-5; 3.67 


IQ- 


-5 


-I 


1.34 


10-5 


[1.29 


10- 


-5; 1.40 


10- 


-5 


p 


0.48 


10-5 


[0.44 


10- 


-5; 0.52 


10- 


-5 



Table 2: Estimation and 95 % CI of aj and p 



Table 3: Estimation and 95 % CI of p. 



parameter 


estimation 


CI 


parameter 


estimation 


CI 


p(o)(l,l) 
p(o)(l,0) 
pW(0,l) 
p(o)(0,0) 


0.56060 
0.03621 
0.04740 
0.35579 


[0.56055; 0.56065] 
[0.03620; 0.03622] 
[0.04739; 0.04741] 
[0.35574; 0.35583] 


pW(l,l) 
p(i)(l,0) 
p(i)(0,l) 
p(i)(0,0) 


0.55928 
0.04707 
0.03755 
0.35611 


[0.55923; 0.55933] 
[0.04706; 0.04708] 
[0.03754; 0.03756] 
[0.35606; 0.35616] 



6 Simulation study 

To investigate the empirical rate of convergence of our estimators as well as the power of the 
symmetry tests we have performed simulations of our coupled BAR-GW model. In particular, 
we study how they depend both on the ratio of missing data and on the number of observed 
generations. 

In a complete binary tree, the number of descendants of each individual is exactly 2. In our 
model of GW tree, the number of descendants is random and its average is asymptotically of 
the order of the dominant eigenvalue tt of the descendants matrix of the GW processes, see |A.1[ 
Therefore tt characterizes the scarcity of data: if tt = 2, the whole tree is observed and there are 
no missing data; as tt decreases, the average number of missing data increases (we choose tt > 1 to 
avoid almost sure extinction). In addition, for a single GW tree, the number of observed individuals 
up to generation n is asymptotically of order tt". 

We have simulated the BAR-GW process for 19 distinct parameters sets, see Tables |4] and 
[5j Sets 1 to 10 are symmetric with decreasing tt (from 2 to 1.08), sets 11 to 19 are asymmetric 
with decreasing tt (from 1.9 to 1.1). The parameters of the BAR process are chosen close to the 
estimated values on E. coli data whereas the GW parameters are chosen to obtain different values 
of TT. Notice that set 18 is close to the estimated values for E. coli data. For each set, we simulated 
the BAR-GW process up to generation 15 and ran our estimation procedure on m = 100 replicated 
trees (m = 94 for E. coli data). Each estimation was repeated 1000 times. 



set 


ao 


6o 


ai 


bi 




0-1 


P 


1 to 10 


0.02 


0.47 


0.02 


0.47 


1.8-10-5 


1.8-10-5 


0.5-10-5 


11 to 19 


0.0203 


0.4615 


0.0195 


0.4782 


2.28-10-5 


1.34-10-5 


0.48-10-5 



Table 4: Parameters sets for the simulated BAR processes. 



We first investigate the significant level and power of our symmetry tests on the simulated data. 
The asymptotic properties of the tests are given in [oil Table[6](resp. Table[7]) gives the proportion 
of reject (significant level 5%) under HO (symmetric sets 1 to 10) and under HI (asymmetric sets 
11 to 19) for the test of symmetry of fixed points HO: ao/(l — 6o) = ~ ^i) (resp. the test 

of equality of vectors HO: (ao, &o) = (oi) ^i))- In both cases, the proportion of reject under HO is 
close to the significant level regardless of the number of observed generations (from 5 generations 
on) and of the value of tt. We thus can conclude that from n = 5 on the asymptotic law is valid. 
Under HI, the proportion of reject increases when the number of observed generations increases 
and decreases when tt decreases. Recall that the number of observed individuals up to generation 
n is asymptotically of order m7r" (m — 100) and the power is strongly linked to the number of 
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set 


p(0) 


p(i) 




1 


(1,0,0,0) 


(1,0,0,0) 


2 


2 


(0.90,0.04,0.04,0.02) 


(0.90,0.04,0.04,0.02) 


1.88 


3 


(0.85,0.04,0.04,0.07) 


(0.85,0.04,0.04,0.07) 


1.78 


4 


(0.80,0.04,0.04,0.12) 


(0.80,0.04,0.04,0.12) 


1.68 


5 


(0.75,0.04,0.04,0.17) 


(0.75,0.04,0.04,0.17) 


1.58 


6 


(0.70,0.04,0.04,0.22) 


(0.70,0.04,0.04,0.22) 


1.48 


7 


(0.65,0.04,0.04,0.27) 


(0.65,0.04,0.04,0.27) 


1.38 


8 


(0.60,0.04,0.04,0.32) 


(0.60,0.04,0.04,0.32) 


1.28 


9 


(0.55,0.04,0.04,0.37) 


(0.55,0.04,0.04,0.37) 


1.18 


10 


(0.50,0.04,0.04,0.42) 


(0.50,0.04,0.04,0.42) 


1.08 


11 


(0.901,0.045,0.055,0.019) 


(0.899,0.055,0.045,0.021) 


1.9 


12 


(0.851,0.045,0.055,0.069) 


(0.849,0.055,0.045,0.071) 


1.8 


13 


(0.801,0.045,0.055,0.119) 


(0.799,0.055,0.045,0.121) 


1.7 


14 


(0.751,0.045,0.055,0.169) 


(0.749,0.055,0.045,0.171) 


1.6 


15 


(0.701,0.045,0.055,0.219) 


(0.699,0.055,0.045,0.221) 


1.5 


16 


(0.651,0.045,0.055,0.269) 


(0.649,0.055,0.045,0.271) 


1.4 


17 


(0.601,0.045,0.055,0.319) 


(0.659,0.055,0.045,0.321) 


1.3 


18 


(0.551,0.045,0.055,0.369) 


(0.549,0.055,0.045,0.371) 


1.2 


19 


(0.501,0.045,0.055,0.419) 


(0.499,0.055,0.045,0.421) 


1.1 



Table 5: Parameters sets for the simulated GW processes. 



observed data. For instance, it is perfect for high numbers of observed generations and high tt 
when the expected number of observed data is huge and it is low for low tt even for high numbers 
of observed generations. 

Next, we investigate the empirical convergence rate of the estimation error ||6'„ — 0\\2 both as 
a function of the number of observed generations n and of tt. Figure [2] (resp. Figure |3| shows the 
distribution of \\9n — ^^Ib/H^lb for n = 9 (reps, n — 15) observed generations for the asymmetric 
parameters sets 11 to 19. It illustrates how the error deteriorates as tt decreases, i.e. as the ratio 
of missing data increases. The two figures have the same scale to illustrate how the relative error 
decreases when the number of o bserved generations is higher. 



We know from Theorem 



4.2 



that the variance of 6'„ is of order |T* |^^/^ which asymptotically 
has the same order of magnitude as 7r~"/^. In order to check how soon (in terms of the number 
n of observed generations) this asymptotic rate is reached, we fitted the logarithm of the errors 
\\0n ~ ^11 2 (averaged over the 1000 simulations) to a linear function of n for each parameters set 
(using the errors from generation 8 to generation 15. The results are shown on Figure [4] 

We also compare the computed slopes of the linear functions to the theoretical value — log(7r)/2 
for the various parameters sets. The results are given in Table [8] and show that the asymptotic 
rate is reached from generation 8 on. It thus validates the accuracy of the study of E. coli data 
conducted in the previous section. 



7 Conclusion 

In this paper, we first propose a statistical model to estimate and test asymmetry of a quantitative 
characteristic associated to each node of a family of incomplete binary trees, without aggregating 
single-tree estimators. An immediate application is the investigation of asymmetry in cell lineage 
data. This model of coupled GW-BAR process generalizes all the previous methods on this subject 
in the literature because it rigorously takes into account: 

• the dependence of the characteristic of a cell to that of its mother and the correlation between 
two sisters through the BAR model, 

• the possibly missing data through the GW model, 
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generation 5 6 7 8 9 10 11 12 13 14 15 

seO 0.037 0.050 0.047 0.048 0.046 0.056 0.046 0.047 0.053 0.041 0.042 

set 2 0.045 0.047 0.047 0.052 0.048 0.053 0.050 0.042 0.040 0.050 0.049 

set 3 0.051 0.048 0.043 0.048 0.057 0.064 0.046 0.045 0.048 0.049 0.052 

set 4 0.051 0.055 0.052 0.056 0.049 0.047 0.052 0.050 0.059 0.058 0.051 

set 5 0.052 0.052 0.049 0.053 0.061 0.065 0.052 0.054 0.040 0.045 0.042 

set 6 0.045 0.036 0.039 0.035 0.051 0.062 0.054 0.061 0.055 0.043 0.046 

set 7 0.045 0.048 0.045 0.044 0.048 0.037 0.041 0.044 0.050 0.049 0.049 

set 8 0.046 0.044 0.044 0.049 0.047 0.048 0.042 0.038 0.043 0.043 0.054 

set 9 0.053 0.052 0.058 0.061 0.060 0.055 0.052 0.052 0.045 0.053 0.051 

set 10 0.039 0.038 0.051 0.046 0.054 0.049 0.054 0.046 0.047 0.046 0.039 

set 11 0.448 0.697 0.926 0.995 1.000 1.000 1.000 1.000 1.000 1.000 1.000 

set 12 0.356 0.568 0.832 0.975 0.999 1.000 1.000 1.000 1.000 1.000 1.000 

set 13 0.305 0.497 0.711 0.894 0.991 1.000 1.000 1.000 1.000 1.000 1.000 

set 14 0.252 0.399 0.586 0.777 0.926 0.994 0.999 1.000 1.000 1.000 1.000 

set 15 0.208 0.293 0.417 0.608 0.808 0.930 0.990 1.000 1.000 1.000 1.000 

set 16 0.200 0.279 0.390 0.502 0.668 0.790 0.905 0.977 0.997 1.000 1.000 

set 17 0.174 0.234 0.287 0.364 0.458 0.566 0.696 0.829 0.912 0.967 0.990 

set 18 0.130 0.165 0.209 0.255 0.335 0.382 0.451 0.548 0.650 0.725 0.811 

set 19 0.118 0.142 0.174 0.190 0.207 0.245 0.300 0.330 0.371 0.416 0.459 



Table 6: Proportion of p- values < 5% for the equality of fixed points test (1000 replications). 



generation 5 6 7 8 9 10 11 12 13 14 15 

sen 0M5 (L062 038 OMl OMl OMl OM) (L033 060 (L036 (L049 

set 2 0.036 0.055 0.049 0.054 0.044 0.048 0.032 0.037 0.039 0.047 0.041 

set 3 0.040 0.044 0.045 0.053 0.057 0.042 0.050 0.039 0.053 0.045 0.039 

set 4 0.053 0.058 0.055 0.047 0.053 0.056 0.061 0.049 0.052 0.048 0.043 

set 5 0.050 0.050 0.049 0.052 0.056 0.049 0.047 0.052 0.044 0.048 0.044 

set 6 0.058 0.043 0.040 0.043 0.052 0.053 0.057 0.056 0.048 0.043 0.051 

set 7 0.032 0.048 0.042 0.032 0.044 0.040 0.046 0.035 0.041 0.052 0.047 

set 8 0.059 0.052 0.058 0.055 0.052 0.050 0.053 0.044 0.050 0.052 0.050 

set 9 0.054 0.049 0.046 0.042 0.048 0.042 0.044 0.050 0.042 0.047 0.045 

set 10 0.042 0.049 0.045 0.044 0.045 0.053 0.051 0.046 0.043 0.044 0.037 

set 11 0.414 0.678 0.920 0.998 1.000 1.000 1.000 1.000 1.000 1.000 1.000 

set 12 0.310 0.557 0.833 0.980 0.999 1.000 1.000 1.000 1.000 1.000 1.000 

set 13 0.286 0.454 0.703 0.902 0.996 1.000 1.000 1.000 1.000 1.000 1.000 

set 14 0.218 0.367 0.555 0.775 0.938 0.995 1.000 1.000 1.000 1.000 1.000 

set 15 0.193 0.276 0.391 0.596 0.789 0.934 0.990 1.000 1.000 1.000 1.000 

set 16 0.175 0.237 0.354 0.479 0.641 0.800 0.925 0.980 0.997 1.000 1.000 

set 17 0.156 0.188 0.246 0.362 0.437 0.540 0.683 0.806 0.919 0.968 0.989 

set 18 0.126 0.152 0.193 0.247 0.285 0.359 0.410 0.525 0.633 0.726 0.819 

set 19 0.110 0.116 0.140 0.161 0.192 0.229 0.271 0.320 0.365 0.395 0.452 



Table 7: Proportion of p-values < 5% for the test (ao,&o) = (cn,&i) (1000 replications). 
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Figure 2: Boxplot of the estimation of the relative error \\dg 
(decreasing tt) 



2/11 ^?|| 2 for the data sets 11 to 19 
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Figure 3: Boxplot of the estimation of the relative error ||6'g — ^?||2/||^||2 for the data sets 11 to 19 
(decreasing tt) 



11 



"5 6 7 8 9 10 11 12 13 14 15 

Figure 4: Logarithm of the error log ||0n — 6*112 as a function of the number n of observed generations 
for the asymmetric parameters set 11 to 19 (from bottom to top: set 11-black circles, set 12-blue 
squares, set 13-magenta diamonds, set 14-red triangles, set 15-black squares, set 16-blue circles, 
set 17-magenta triangles, set 18-red diamonds, set 19-black stars). 

set 11 12 13 14 15 16 17 18 19 

empirical slope -0.3170 -0.2966 -0.2634 -0.2325 -0.2060 -0.1801 -0.1413 -0.0953 -0.0672 
-log{TT)/2 -0.3209 -0.2939 -0.2653 -0.2350 -0.2027 -0.1682 -0.1312 -0.0912 -0.0477 

Table 8: Logarithm of empirical convergences rates vs theoretical rate 



• the information from several sets of data obtained in similar experimental conditions without 
the drawbacks of poor accuracy or power for small single-trees. 

Furthermore, we propose the estimation of parameters of a two- type GW process in the specific 
context of a binary tree with a fine observation, namely the presence or absence of each cell of 
the complete binary tree is known. In the context where missing offspring really come from the 
intrinsic reproduction, and not from faulty measures, the asymmetry of the parameters of the GW 
process can be applied to cell lineage data and be interpreted as a difference in the reproduction 
laws between the two different types of cell. 



We applied our procedure to the E. coli data of Stewart et al (20051 and concluded there exists 



a statistically significant asymmetry in this cell division. Results were validated by simulation 
studies of the empirical rate of convergence of the estimators and power of the tests. 



A Technical assumptions and notation 



Our convergence results rely on martingale theory and the use of several carefully chosen filtrations 



regarding the BAR and/or GW process. The approach is similar to that of de Saporta et al (2011 1; 



de Saporta et al (2012 1, but their results cannot be directly applied here. This is mainly due to our 



choice of the global non-extinction set as the union and not the intersection of the non-extinction 
sets of each replicated process preventing us from directly using convergence results on single-tree 
estimators. We now give some additional notation and the precise assumptions of our convergence 
theorems. 
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A.l Generations and extinction 



We first introduce some notation about the complete and observed genealogy trees that will be used 
in the sequel. For all n > 1, denote the n-th generation of any given tree by G„ — {k, 2" < fc < 
2"+! _ ly particular, Go — {1} is the initial generation, and Gi = {2, 3} is the first generation 
of offspring from the first ancestor. Denote by T„ — 1J"^q G^ the sub-tree of all individuals from 
the original individual up to the n-th generation. Note that the cardinality |G„| of G„ is 2", while 
that of T„ is |T„| = 2"+^ — 1. Finally, we define the sets of observed individuals in each tree 
G* „ - {fc e G„ : S^j^k) = 1} and T* „ = {fc € T„ : fc) = 1}, and set 



7n 



and |T;| = ^|T*,J, 



the total number of observed cells in all m trees in generation n and up to generation n respectively. 
We next need to characterize the possible extinction of the GW processes, that is where |T* | does 
not tend to infinity with n. For I < j < m and n > 1, we define the number of observed cells 
among the n-th generation of the j-th tree, distinguishing according to their type, by 

^3,n= X! %.2fe) and Zj„= ^ <5(j,2fc+i), 

fceG„_i /ceG„_i 

and we set Zj^n = [Z^ rn n)- For all j. the process (Zj_„) thus defined is a two- type GW process. 



see Harris (19631. We define the descendants matrix P of the GW process by 

■p ^ ( P"o Poi 

\ PlO Pll 

where = p(^)(l,0) +p(*)(l, 1) and pa = pW(0, 1) 1)> for * ^ {0, 1}. The quantity pu is 

thus the expected number of descendants of type I of an individual of type i. It is well-known that 
when all the entries of the matrix P are positive, P has a positive strictly dominant eigenvalue. 



denoted tt, which is also simple and admits a positive left eigenvector, see e.g. (Harris 1963 
Theorem 5.1). In that case, we denote by z = (z°,z^) the left eigenvector of P associated with 
the dominant eigenvalue tt and satisfying z'^ + ~ 1. Let £j — Un>i{^i," ~ i^' ^^'^ event 

corresponding to the case when there are no cells left to observe in the j-th tree. We will denote 
£j the complementary set of £j . We are interested in asymptotic results on the set where there is 
an infinity of to be observed that is on the union of the non-extinction sets £j denoted by 

m 

£=\j£,^{\im im^oo}. 

Note that we allow some trees to extinct, as long as there is at least one tree still growing. This 
assumption is natural in view of the E. coli data as the collected genealogies do have a significantly 
different numbers of observed generations (from 4 up to 9) . 



A. 2 Assumptions 

Our inference is based on the m i.i.d. replicas of the observed BAR process, i.e. the available 
information is given by the sequence {6(^j^k)T ^{j,k)^{j,k))i<j<m.k>i- We first introduce the natural 
generation-wise filtrations of the BAR processes. For all 1 < j < m, denote by = (J^,,„)„>l the 
natural filtration associated with the j-th copy of the BAR process, which means that J'j^n is the a- 
algebra generated by all individuals of the j-th tree up to the n-th generation, J-j^n = crl-^Q-.fc), k € 
T„}. For all 1 < j < m, we also define the observation filtrations as Oj^n — '^{\j,k)j k G Tr„}, and 
the sigma fields Oj = a{6^j,k), k > 1}. 

We make the following main assumptions on the BAR and GW processes. 

(H.O) The parameters (ao, bo, ai, bi) satisfy the usual stability assumption < max{|6o|, |^i|} < 1- 
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(H.l) For all 1 < j < rn, n > 0, fc e G„+i, E[e(|fc)] < oo and E[X(^/i)] < oo. 
For all 1 < j < m. n > 0, fc e G„ and i e {0, 1}, one a.s. has 

For all 1 < j < m, n > 0, fc e G„, one a.s. has 

E[e(j;2fe)£0\2fe+l)|-^j\«] = P, Hl^'fj,2kffj,2k+l)\^3,n] = 1^^, H^fj,2kflj,2k+l)\^J,n] = V^, 
H^U.2kfU,2k+l)\J^j,n] a, E[e(j,2fe)e(j,2fc+l)l-^j\n] = 

(H.2) For all I < j < m and n > the vectors {(eq, 2/0)7 £(j,2fe+i)); k e G„} are conditionally 
independent given J^,,„. 

(H.3) The sequences (£(i.fe))A:>2i (£(2.fc))fc>2: • • ■ j (£(m,fc))fc>2 <ire independent. The random vari- 
ables (^(j,i))i<j<m a-re independent and independent from the noise sequences. 

(H.4) For all 1 < < m, the sequence ((5(j.fe))fe>i is independent from the sequences (X(j,A;))fe>i 
and {s(j^k))k>2- 

(H.5) The sequences (<5(i_/c))fe>2j ('^(2,fc))fc>2j • ■ • ? (<^(m.A;))fe>2 E^re independent. 
We also make the following super criticality assumption on the matrix P. 

(H.6) All entries of the matrix P are positive: for all E {0,1}^, pu > 0, and the dominant 
eigenvalue is greater than one: tt > 1 . 

If TT > 1, it is well known, see e.g. Harris] ( 1963| , that the extinction probability of the GW 



processes is less than one: for all 1 < j < m, P(5j) = p < 1. Under assumptions (H.5-6), one thus 
clearly has ¥{£) = 1 - > 0. 



Note that under these assumptions, it is proved in de Saporta et al (2011) that the single-tree 
estimators 6j^n are consistent on the single-tree non-extinction sets £j . This result is based on the 
separate convergence of „ and Sj,„6'j_„. Therefore, the convergence of our global estimator 0„ 
is readily obtained on the intersection of the single-tree non-extinction sets Hj^iSj, see Section|2j 
However, we are interested in the convergence of the global estimator on the larger set £ = U^JLiEj. 



This is why we cannot directly use the results of jde Saporta et al (2011 1. We explain in the following 



sections how the ideas therein have to be adapted to this new framework. 
A.S Additional estimators 

From the estimators of the reproductions probabilities of the GW process, one can easily construct 
an estimator of the spectral radius tt of the descendants matrix P of the GW process. Indeed, P is 
a 2 X 2 matrix so that its spectral radius can be computed explicitly as a function of its coefficients, 
namely 

' V(P) + (tr(P)2 ~ 4det(P))^ 



2 

Replacing the coefficients of P by their empirical estimators, one obtains 

-«=2(T« + (f„^-45„)V2). 

where 

f„ = pio)(l,0)+p<°)(l,l)+i5li)(0,l)+i5<i)(l,l), 

S„ = (p<„°)(l,0)+plo)(l,l))(pli)(0,l)+i5<:)(l,l))-(pl°)(0, 

are the empirical estimator of the trace tr(P) and the determinant det(P) respectively. Finally, 
to compute confidence intervals for and p, we need an estimation of higher moments. We use 
again empirical estimators 

"^i," ^ ■pfpii ^(j,2fc+i)' '^n = irn^^oi I / , 2^ ^(j,2kf{j,2k+l)- 
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B Convergence of estimators for the GW process 



We now prove the convergence of the estimators for the GW process, that is the first parts of 
Theorems |4.1| and |4.2[ together with additional technical results. 



B.l Preliminary results: from single-trees to multiple trees 



Our objective is to show that we can adapt the results in de Saporta et al ^2011 1 to the multiple tree 



framework despite our choice of considering the union and not the intersection of the single-tree 



non-extinction sets. To this aim, we first need to recall Lemma A. 3 of Bercu et al (20091. 



Lemma B.l Let (A„) be a sequence of real-valued matrices such that 

oo n 

E||A„|| < OO and lim } A^ = A. 
'n — ^nn < ^ 



fe = 



In addition, let (X„) be a sequence of real-valued vectors which converges to a limiting value X. 
Then, one has 



lim A„_£X£ = AX. 

77, — ^nn ' * 



The next result is an adaptation of Lemma A. 2 in Bercu et al (2009 1 to the GW tree framework 



It gives a correspondence between sums on one generation and sums on the whole tree. 
Lemma B.2 Let (a;„) be a sequence of real numbers and vr > 1. One has 

TT - 1 



iim — 

fceT„ 



Xk ^ X 



iim — 

feeG„ 



Proof: Suppose that tt " X^feeT converges to x. Then one has 



T7-n '^^ ™-n 



1 1 



Conversely, if 7r~" X^fceG converges to y, as T„ = U"^qG£, one has 



1 TT - 1 

> X x = x. 

n— f oo TT TT 



_n ^ ^k I o ^ Xk 



fceT„ 



1=0 



TT 

3 TT — 1 



using Lemma B.l with A„ = tt " and X„ = tt " X^fceG ^k- 



□ 



We now adapt Lemma 2.1 of de Saporta et al (2011) to our multiple tree framework 



Lemma B.3 Under assumption (H.5-6), there exist a nonnegative random variable W such that 
for all sequences „)), . . . , (^^(m^n)) of real numbers one has a.s. 



lim 



4Ig:;I>o} 
IT* I 



lim 

n— ^oo TT 



^ m 

EE 



=(j,fe) 



j=i feeT„ 



TT - 1 



-W. 



Proof: We use a well known property of super-critical GW processes, see e.g. Harris (19631: for 
all j, there exists a non negative random variable Wj such that 



IT* I 
lim ^ 



n— >-oo TT 



vr- 1 



-Wi a.s. 



(4) 



and in addition {Wj > 0} = £j = lim{|G*„| > 0}. Therefore, one has 



lim y 



IT* 



71— >-CX3 ' TT 



lim 

n— >-oo 



IT 



TT" TT - 1 ^ 
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The result is obtained by setting W — X^jLi noticing that £ — UjLiEj = {J^JLi > 

0} = lim{|G:J > 0}. □ 

Finally, the main result of this section is new and explains how convergence results on multiple 
trees can be obtained from convergence results on a single-tree. This will allow us to directly use 
results from de Saporta et al (2011) in all the sequel. 

Lemma B.4 Let (a;(i „)), . . . , {x(^m,n)) m sequences of real numbers such that for all 1 < j < m 
one has the a.s. limit 

then under assumptions (H.5-6) one also has 

Proof: Equations ^ and Q yield, for all j, 

lim — X(ik) W^i- 

fc6T„ 

Summing over j , one obtains 

„7r-l. 



Finally, we use Lemma [B.3| to conclude. □ 

B.2 Strong consistency for the estimators of the GW process 

To prove the convergence of the p^n{la,W) we first need to derive a convergence result for a sum 
of independent GW processes. 

Lemma B.5 Suppose that assumptions (H.5-6) are satisfied. Then for i G {0, 1} one has 

ni 

J™,l{|G*|>o}|T;r'^ %.2fc+»)=^'% a.s. 

Proof Remarking that X^JLi X^^gt i '^(j,2fe+i) — Xjli Xr=i Z' ^^"-^ lemma is a direct conse- 
quence of Lemma B.4 and the well-known property of super-critical GW processes 1{|G' |>o}|irj „|"^ Xr=i '^3,1 ~^ 
zl^, for all < j^m. ' " ' □ 



Proof of Theorem 4-1 . first part We give the details of the convergence of pfn\\, 1) to p^^-*(l, 1), the 
other convergences are derived similarly. The proof relies on the convergence of square integrable 
scalar martingales. Set 

m 

Mn = E E %^2fc + l) ('5o-,4fc+2)%,4fc+3) " P^^^ (1, 1)) • 
3=1 fceT„_2 

We are going to prove that (M„) is a martingale for a well chosen filtration. Recall that Oj^n = 
a{S(^j k),k £ Tr„}, and set 0„ — Vj^iOj^n- Then (M„) is clearly a square integrable real (On)- 
martingale. Using the independence assumption (H.5), its increasing process is 

< M >n= E E %2fe+i)p(^)(i, i)(i - P^'Hh 1)) = p'^Hi, i)(i - P^'Hi, 1)) E E 4.- 

i=i feeT„_2 i=i fco 
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Hence, Lemma B.5 implies that |T*_]^|~-'^ < M >„ converges almost surely on the non-extinction 
set £. The law of large numbers for scalar martingales thus yields that |T*_]^|^^M„ tends to as 
n tends to infinity on £. Finally, notice that 



1) _ „(l)fl 1) ^ _ 



SO that Lemma 



B.5 



again implies the almost sure convergence of j% (1, 1) to ^'-^'(1, 1) on the non- 
extinction set £. □ 

As a direct consequence, one obtains the a.s. convergence of to tt on £. 

B.3 Asymptotic normality for the estimators of the GW process 

As P(E) ^ 0, we can define a new probability by VgiA) = V{Ar\£)/P(E) for all event A. In 
all the sequel of this section, we will work on the space £ under the probability and we denote 
by the corresponding expectation. We can now turn to the proof of the asymptotic normality 
of p„. The proof also relies on martingale theory. As the normalizing term in our central limit 
theorem is random, we use the central limit theorem for martingales given in Theorem 2.L9 of 



Duflo (19971 that we first recall as Theorem B.6 for self-completeness 



Theorem B.6 Suppose that {n,A,P) is a probability space and that for each n we have a filtration 
Pn = (•^fe"''); 0, stopping time relative to F„ and a real, square-integrable vector martingale 
7\/f(") — (Af^"-')fc>o which is adapted to F„ and has hook denoted by < M >^"^. We make the 
following two assumptions. 



A.l For a deterministic symmetric positive semi-definite matrix T 

< M >(,'^) A r. 

A. 2 Lindeberg's condition holds; in other words, for all e > 0, 



fc=i 



Then: 



Proof of Theoremr\4-.S\ first part First, set 



V7zO 
VVzi 



(6) 



where for all i in {0, 1}, V = W — p*^*^ (p^*))*, W is a 4 x 4 matrix with the entries of p^'^ on 
the diagonal and elsewhere. We are going to prove that V is the asymptotic variance of p„ — p 
suitably normalized. We use Theorem |B.6[ We first need to define a suitable filtration. Here, we 
use the first cousins filtration defined as follows. Let 

^i,p = ■ ■ • ^(j,3), {3{jAk),- ■ ■ , i < k <p} 

be the cr-field generated by all the 4-tuples of observed cousin cells up the granddaughters of 
cell {j,p) in the j-th tree and Hp = y^jLi^j,p- Hence, the 4-tuple ((5(^,4^), . . . , (5(j_4fe+3)) is Tik- 
measurable for all j. By definition of the reproduction probabilities p'-^^{lo, li), the processes 
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are ('Hfe)-martingale difference sequences. We thus introduce a sequence of ("HiO-martingales 
{p>i} defined for all n>l and p > 1 by 

p m 

fc=li=l 



withD(,-,,) = ((DO )^(Dl )*)*and 



/ %,2(2fc+i))'5(i.2(2fe+j) + l) -P^^Hl; 1) \ 

%,2(2fc+»))(l - <5(j,2(2fe+») + l)) -p(*'(l:0) 
(1 - %,2(2fe+i)))'5o-,2(2fc+i) + l) -P'-^HO, 1) 

V (l-%2(2fe+»)))(l-%,2(2fe+j) + l))-P**H0,0) / 



We also introduce the sequence of stopping times = |Tr„_2| =2" ^ — 1. One has 





%[D(,,fe)D,-fe)l^fe-i] = 



%,2/c)V" 





^0-,2fc+l) 



Therefore the one has < M(") >^„= |T*_i|-i X^JLi Efco 
almost sure limit is 







yl \rl j 7 tha.t its 



r' 



thanks to Lemma 
condition A. 2 is o 



B.5 Therefore, assumption A.l of Theorem B.6 holds under P^. The Lindeberg 



jviously satisfied as we deal with finite support distributions. We then conclude 



that under one has 



Using the relation 



I (e;=i Er; 4.)i4 



Lemma [B . 5| and Slutsky's Lemma give the first part of Theorem |4. 2 



□ 



B.4 Interval estimation and tests for the GW process 



From the central limit theorem |4.2| one can easily build asymptotic confidence intervals for our 
estimators. In our context, F„ and Yj[ being two random variables, we will say that [l^;y^] is an 
asymptotic confidence interval with confidence level 1 — e for the parameter Y if P^(y„ Y < 
Y^[) > (1 — e). For any < e < 1, let be the 1 — e/2 quantile of the standard normal 

?i— >-oo 

law. 

For all n > 2, define the 8x8 matrix 

V ^ V,\(X;jliE/ceT„_2 %.2fe+i)) , / 

where for all i in {0, 1}, VJ^ = — Pn\pn^Y, WJ^ is a 4 x 4 matrix with the entries of pi*'' on 
the diagonal and elsewhere. Thus, |T*_j^|V„ is an empirical estimator of the covariance matrix 
V. 
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Theorem B.7 Under assumptions (H. 5-6), /or z, Zq, ?i in {0,1} and for any < e < 1, the random 
interval defined by 

is an asymptotic confidence interval with level 1 — e /or p*^*-* (Zq, ^i)/ where {i,£) is the coordinate of 
V„ corresponding to p^^\Iq, li), namely i = 4(« + 1) — (2/o + ^i). 

Proof This is a straightforward consequence of the central hmit Theo rem |4.2| toget her with Slut- 
sky's lemma as lim„^oo l''r*i_i|V„ ~ V a.s. thanks to Lemma 

Set G„ — F^V„F„, where F„ is the 8x1 vector defined by 



B.5 



and Theorem 



4.1 



and 



F„ = ^(l,l,0,0,l,0,l,0)* + ^(f2-45„)-'/'H„ 



/ p^^\l,l)+P^n\l,0)+piP{l,l) + 2pi'\l,0)-p^^\0,l) \ 

(1, 1) + p<°)(l, 0) - pL'^ (1, 1) - p^r^\0, 1) 

2p<,^)(l,l)+2pi^'(l,0) 


pi?{l, 1) - (1,0) + 2pl") (0, 1) + piP{l, 1) + pL'^ (0, 1) 
2pl°)(l,l)+2pl°)(0,l) 



-p^^\i,i)-ji°\i,o)+pk'\i,i) +pL'\oA) 



V 



□ 



Theorem B.8 Under assumptions (H.5-6), for any < e < 1 one has that 
is an asymptotic confidence interval with level 1 — e for tt. 

Proof This is again a straightforward consequence of the central limit Theorem |4.2| together with 
Slutsky's lemma as F„ is the gradient of the function that maps the vector p onto the estimator 

7f„. □ 

We propose two symmetry tests for the GW process. The first one compares the average number 
of offspring mo of a cell of type 0: mo = p^°'> (1,0)+ (0, 1) + 2p^°\l, 1) to that of a cell of type 
1: mi = p'"'^'(l, 0) +p'^^-'(0, 1) + 2p(-'^^(l, 1). Denote by m° and m^ their empirical estimators. Set 

• H™: mo = mi the symmetry hypothesis, 

• H™: mo ^ mi the alternative hypothesis. 
Let be the test statistic defined by 

>r = in-ii'/'(A:r)-'/'(m?,-m,\), 

where A™ = |T;_i |dg^V„dg„ and dg„ = (2, 1, 1,0,-2 - 1,-1,0)*. This test statistic has the 
following asymptotic properties. 

Theorem B.9 Under assumptions (H.5-6) and the null hypothesis H™, one has 

on (£,P^); and under the alternative hypothesis H™, almost surely on £ one has 

lim (y™)2 = +00. 
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Proof Let g„i be the function defined from onto M by gm{xi, . . . , xs) ~ 2xi+X2+X3+2xc,—xe—xr 
so that dgm is the gradient of gm ■ Thus, the central hmit Theorem |4.2| yields 



on (£,P^). Under the null hypothesis H™, 5m(p) = 0, so that one has 



on {£ 



Lemma 



B.5 



and Theorem 



4.1 



give the almost sure convergence of A™ to A™. Hence 



Slutsky's Lemma yields the expected result. Under the alternative hypothesis H™, one has 



= (A™)-V2(^^^^(g„(p„) _5,„(p)) + y^|T;_i|.g,„(p)). 

The first term converges to a centered normal law and the second term tends to infinity as |T*_j^| 
tends to infinity a.s. on (£,Pg-). □ 

Our next test compares the reproduction probability vectors of mother cells of type and 1. 

• Hq! p(°) — p(^^ the symmetry hypothesis. 

• H^': p'^-* 7^ p'^-* the alternative hypothesis. 
Let (Y^J)*YP be the test statistic defined by 

Y^ = |T:_,|V^(A^J-V2(p(°)-p«), 

where — |T* |dgpV„dgp and dg^ = ^ ) ' ^^^^ statistic has the following asymp- 
totic properties. 

Theorem B.IO Under assumptions (H.5-6) and the null hypothesis Hg, one has 

on (£^,P^); and under the alternative hypothesis H^, almost surely on £ one has 

lim ||YP||2 = +0O. 



Proof We mimic the proof of Theorem B.9 with g^ the function defined from onto by 
gp(xi, . . . , Xg) — [xi — x^,X2 — xg, X3 — Xy, 3:4 — a;g)*, so that dg^ is the gradient of gp. □ 



C Convergence of estimators for the BAR process 

We now prov e th e convergence of the estimators for the BAR process, that is the parts of Theo- 
concerning 0„, ct^ ^ and p„, together with additional technical results, especially 



4.1 



and 



4.2 



the convergence of higher moment estimators required to estimate the asymptotic variances. 
C.l Preliminary results: laws of large numbers 

In this section, we want to study the asymptotic behavior of various sums of observed data. Most 



of the results are directly taken from de Saporta et al (20111. All external references in this section 



refer to that paper that will not be cited each time. However, we need additional results concerning 
higher moments of the BAR process in order to obtain the consistency of t:^„ and v'^^, as there is no 
such result in de Saporta et al (20111. We also give all the explicit formulas so that the interested 



reader can actually compute the various asymptotic variances. 

Again, our work relies on the strong law of large numbers for square integrable martingales. 
To ensure that the increasing processes of our martingales are at most 0(7r") we first need the 
following lemma. 
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Lemma C.l Under assumptions (H.0-6), for all i G {0, 1} one has 



Proof The proof follows the same lines as that of Lemma 6.1. The constants before the terms A^^, 
Bl^ and therein are replaced respectively by (4/(1 — f3)y, a*(4/(l — /3))^ and 2^; in the term 
Al^, is replaced by in the term C^, ji'^'^'' is replaced by the term is unchanged. In 

the expression of E[(y/p)-^], one just needs to replace by /ij^, cr^ by 7^!'^^ and v^t^ by 77^. Note 
that the various moments of the noise sequence are defined in assumption (H.l). The rest of the 
proof is unchanged. □ 

We also state some laws of large numbers for the noise processes. 
Lemma C.2 Under assumptions (H.0-6), for all i G {0, 1} and for all integers < 5 < 4, one has 



TT 



TT 



Proof This is also a direct consequence of de Saporta et al (2011 1 thanks to Lemmas B.3 and B.4 



Lemma 5.3 provides the result for g = 0, Lemma 5.5 for g = 1, Corollary 5.6 for q = 2 and Lemma 
5.7 for q = A. The result for g = 3 is obtained similarly. □ 

In view of these new stronger results, we can now state our first laws of large numbers for the 
observed BAR process. For z G {0, 1} and all integers 1 < q < 4 let us now define 

in m 
m 7n 

j=i j=i fceT„ 

and H„(g) = (H0(g),i7i(<z))*. 

Lemma C.3 Under assumptions (H.0-6) and for all integers 1 < 9 < 4, one has the following a.s. 
limits on the non- extinction set £ 



lim l{|G.|>o}|T:j-iH„(g) = h(9) = (l2-P«)-^P'h(g), 



lim 1 

n— >-oo 



{|G 



+pW(l,l)(/ii(g) + 6?^) 



whe 



and /or i G {0, 1} 



/i*(l) = a,z\ 
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Proof The results for q = 1 and q = 2 come from Propositions 6.3, 6.5 and 6.6 together with 



Lemma |B.4[ The proofs for q > 3 follow the same lines, using Lemma C.2 when required and 



Lemma [C.l| to bound the increasing processes of the various martingales at stake. □ 

To prove the consistency of our estimators, we also need some additional families of laws of 
large numbers. 

Lemma C.4 Under assumptions (H.0-6), for i £ {0, 1} and for all integers 1 < p + 9 < 4, one 
has the following a.s. limits 



1 



{|g;|>o} 



j=i feeT„ 

Proof The proof is similar to that of Theorem |4.1[ For all 1 < j < m, one has 

fceT„ 

n 

e=o fceGf fceT„ 

as the conditional moment of £2fc+z are constants by assumption (H.l). The first te rm is a square 
integrable (J^j'^„)-martingale and its increasing process is 0{tt") thanks to Lemma C.l thus the 

□ 



first term is 0(77"). The limit of the second term is given by Lemma C.3 



Lemma C.5 Under assumptions (H.0-6), for i G {0, 1} and for all integers 1 < 9 < 4, one has 
the following a.s. limits 

rn 

l{|G;|>0}inr'E E h,.2k+.)Xl,,^^ = {nh\q)+b!h\q))lj. 

j=i feeT„ 

Proof The proof is obtained by replacing X(^j 2k+i) by Oi + biX^ + £2k+i- One then develops the 
exponent and uses Lemmas |B.5[ |C.2| |C.3| and |C.4| to conclude. □ 

Lemma C.6 Under assumptions (H.0-6), for i G {0, 1} and for all integers 1 < P + 9 < 4, one 
has the following a.s. limits 

m 

j=i fceT„ 

with 

h\p,l) = a,h\p)+b,h\p+l), 

h\p,2) = {a1 + <j1)h\p) + 2aAh\p+l) + h1h\p + 2), 

h\p, 3) = (a^ + 3a,(7f + \i)h\p) + 36, (a^ + af)h}{p + 1) + 3a,b^h'{p + 2) + bfh\p + 3), 
where we used the convention /i*(0) = zV. 

Proof As above, the proof is obtained by replacing X(^j 2k+i) ^^nd developing the exponents. Then 
one uses Lemmas |B.5| |C.2| |C.3| and |C.4| to compute the limits. □ 

Lemma C.7 Under assumptions (H.5-6), one has the following a.s. limit 

rn 

l{|G*|>o}|Tr* E E '^(i,2(2fc+i))<5(j,2(2fe+i)+i) = p'''Hl,l)z'^r'^£■ 
j=l feeT„ 
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Proof First note that (S(j,2(2fc+i})%,2(2fe+j)+i) = S(j,2k+t)Sij,2{2k+i))S{j,2{2k+t)+i)- The proof is then 
similar to that of Theorem 4.1 One adds and subtract p^^^{l, 1) so that a martingale similar to 



(Af„) naturally appears. The limit of the remaining term is given by Lemma B.5 □ 



Lemma C.8 Under assumptions (H.0-6), for all integers 0<p + q + r<4:, one has the following 
a.s. limits 

m 

l{|G*|>0}|T;r^X! %2fc)%2fe+l)^('j-fc)£(j-2fc)£0-,2fc+l) =E[£|£^]/l°\p)l£, 

j = l fcgT,, 



where we used the convention h^^{0) = l)z° + p''^^{l, l)z 



Proof The proof is similar to Lemma C.4 one adds and subtracts the constant ]E[£(j 2fc)^(j 2fc+i) I -^jel- 

Lemma C.9 Under assumptions (H.0-6), for all integers 1 < p + q + r < 4, one has the following 
a.s. limits 

m 

l{|G*|>o}|T;r^^ J2 %2/c)'5(j,2/c+i)^fj-fc)^5,2fc)^('j,2fc+i) =h"\p,q,r)lj, 
j=i /ceT„ 

with 

/i°i(p,l,0) = aoh°\p) + boh'>\p+l), h'>\p,0,l) = a,h"\p) + hh''\p+l)), 
/i"i(p,2,0) - {al+4)h''\p) + 2a„boh''\p+l) + blh''\p + 2), 
h°\p,0,2) = {al+af)h''\p) + 2aibih''\p+l) + blh''H{p + 2), 

h°\p, 3, 0) = (ag + 3aoa2 + Ao)/i°ib) + 36o(ao + ^^Dh^^P + 1) + ^aoblh°\p + 2) + blh°\p + 3), 
h°\p, 0, 3) = (a? + 3aial + \i)h°\p) + 3bi{al + aj)h'>\p + 1) + Saibjh^^p + 2) + blh"\p + 3), 

= {aoai+p)h^\p) + {aob,+boai)h°\p + l)+bob,h"\p + 2), 
h"\p, 2, 1) = {{al + al)ai + 2aop + a)h°\p) + ((a^ + al)bi + 2(aoai + p)foo)/i°'(p + 1) 

+6o(2ao6i + boai)h°\p + 2) + blbih°\p + 3), 
h°\p, 1, 2) = ((a? + af)ao + 2a^p + P)h°\p) + ((a? + a^)^^ + 2(aoai + + 1) 

+6i(ao6i +26oai)/i°i(p + 2) +6o6?/i°^(p + 3), 
ft.°^(0,2,2) = (a^a^ + a^crJ + a^cr^ + i/^ + 2ao/3 + 2aia + 4aoaip)/i°^(0) 

+2(feo(ao(ai + crj) + /? + 2ai/9) + 6i(ai(a^ + CTq) + a + 2ao/9))/i°^(l) 

(62(a? + a?) + 62(^2 ^ ^2) ^ Ababi{aoai + p))h°\2) 

+26o6i(ao&i + 6oai)/i°^(3) + &o^?/i°'(4). 

Proof The proof is obtained by replacing X(j_2fc+i) by Oi + + e2k+i and developing the expo- 
nents. One uses Lemmas |C.3| and |C.8| to compute the limits. □ 

To conclude this section, we prove the convergence of the normalizing matrices Sj^. and S""'^ 
where 

with the sum taken over all observed cells that have observed daughters of both types. 

Lemma C.IO Suppose that assumptions (H.0-6) are satisfied. Then, there exist definite positive 
matrices L°, Li^ and iP^ such that for i G {0, 1} one has 



n—>oo 

where 

V = 



hm l{|G.|>o}|T;riS:, = ljL\ lim |>o} im-iSS'/ = %L 



"1 a.s. 



Proof This is a direct consequence of Lemmas |B.5| and |C.3| □ 
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C.2 Strong consistency for the estimators of the BAR process 

We could obtain the convergences of our estimators by sharp martingales results as in |de Saporta| 



at al (2011), see also B.2 However, we chose the direct approach here. Indeed, our convergences 



are now direct consequences of the laws of large numbers given in |C.1[ 

Proof of Theorem \4-l\ convergence of On This is a direct consequence of Lemmas 
Indeed, by Lemma [C . 6| one has 



C.IO 



and 



C.6 



HIg;,iI>o} 
IT* J 



4|G*-il>0} 

it: 



n-l I 



E E 



3=1 fceT„ 

And one concludes using Lemma [C.10[ 



( %.2fe)^(i,2fc) 



LO 
Li 



□ 



Proof of Theorem 4-1 convergence ofa~,^ and pn This result is not as direct as the preceding one 
because of the presence of the 6^ in the various estimators. Take for instance the estimator af^. 
For all 1 < j < m, one has 



E= 



U.2k+i) 



keTr, 



E E ^U,^k+i){X[j^2k+i) ~ 0,1,1 — bi/X^^j^k))"^ 

m n— 1 

E ^U,2k+i)Xfj^2k+t) + E E E 



+2 ^ tti^ibi^i ^ %,2fe+i)^0',fc) 

n-l 

~2^ai,£ E ^U,'2k+i)X[jak+i) 



r2 



e=0 keGe 

Let us study the limit of the last term. One has 

n— 1 n—1 



n-l 

■ E E %,2fc+»)^C 

e=o keGt 

n-l 

2^&i,£ ^ S(j^2k+i)X(j^k)X[j^2k+i)- 
e=0 keGi 



;^E^^>^ E \j-2k+i)X(^j^k)X(j,2k+{) - "E \ E ^U,2k+i)X(^j^k)Xlj.2k+i) 



IT 

£=0 keGe 

We now use Lemma 



vr ^ — ' vr'' 

^=0 



keGi 



B.l 



^with A„ = vr " a nd X „ ~ bi^„TT J2keG„ ^U-'2k+i)X(^j^k)X(j^2k+i)- We 
know from Lemma C.6 together with Lemma B.2 that tt^" X^feGG ^ij,2k+i)X(j^k)X(j^2k+i) converges 
to /i'^(l, i)Wj, and the previous proof gives the convergence of 6i^„. Thus, one obtains 

1 vr^ 
^j^E^'^'^ E %,2fc+i)-^0-,fc)^(i,2fe+*) ^j— yWj-6/i°(1,1). 

We dea l with the other terms in the decomposition of the sum of e^k ^ similar way, using either 
Lemma C.3 C. 5 or C.6 Finally, one obtains the almost sure limit on £ 



To obtain the convergence of p„ the approach is similar, using the convergence results given in 
Lemmas laSllOTllaSl and ICJl □ 



Theorem C.ll Under assumptions (H.0-6), and converge almost surely to and 
respectively on £ . 

Proof We work exactly along the same lines as the previous proof with higher powers. □ 
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C.3 Asymptotic normality for the estimators of the BAR process 

We first give tlie asymptotic normality for 



Proof of Theorem 4-2 for 0„ Define the 4x4 matrices 



LO 
Li 



2L" pLOi 

01 „2r 1 



(7) 



We now follow the same lines as the proof of the first part of Theorem 4.2 with a different filtration. 
This time we use the observed sister pair-wise filtration defined as follows. For < j < m and 
p>0, let 



(8) 



be the cr-field generated by the j-th GW tree and all the pairs of observed sister cells in genealogy 
j up to the daughters of cell {j,p); and let = Vj^^Ofp be the u-field generated by the union of 
all Q'j'p for I < j < m. Hence, for instance, (<5(j 2fc)£(j,2fe)j '^(j,2fe+i)£(j,2fc+i)) is ^/^-measurable for 
all j. In addition, assumptions (H.l) and (H.4-5) imply that the process 

{S(j,2k)£Uak),X{j,k)S{j,2k)£U,2k),^U,2k+l)£{j,2k+l),X(^j^k)S(j^2k + l)£U,2k+l)Y 

is a (tj^)-martingale difference sequence. Indeed, as the non-extinction set £ is in for every 
/c > 1, it is first easy to prove that E^[5(j,2fc)£(j,2fe) I^^F-il ~ ^[^{j,2k}£{j.2k)\G'k-i\- Then, for k E G„, 
using repeatedly the independence properties, one has 

E[(5(j,2fc)eo-,2fc)|^fc'-i] 

= %2fe)E[E[£(,,2fc)|OVJ-„Va(ej-p,l< j<m,peG„+i,p<2fc-l)] | 
= %,2fc)E[E[£(j_2fc)|J^„ V cr(£j,p, I <j < m,p e G„+i,p <2k- 1)] | ^f^J 
= (5(,,2fc)E[E[£(,-2fc)|J"„] I = %,2fe)E[E[e(,-2fe)|^j,n] I ef-i] = 0. 



We introduce a sequence of (fj^)-martingales (Mp"''){p>i} defined for all n,p > 1 by Mp 



r(") _ 



it: I 



-1/2 ^p 



ELi^fc, with 



i=i j=i 



'5(j,2fc)eo\2fc) 
X(j,k)S{j,2k)£u,2k) 

S{j,2k+l)£(j,2k+l) 
V ^a,fe)%,2fe+l)£(i.2fc+l) J 



We also introduce the sequence of stopping times !/„ 



iri+l 



1. We are interested in the 



convergence of the process Mi"-* = jT^J SL^i ^k- Again, it is easy to prove that 



Ej[T>kBl\gti] = E[DfeD* = E 1 .,„oi ■ „2.„i 



j=i 



where for i G {0, 1}, 



'''(i.fc) ~ '^(j,2fc)%,2fe+l) 



X 



Lemma 
is r, as 



C.IO 



1 X(i.fe) 
-1 



yields that the almost sure hmit of the process < M(") >^„= |T;| EfceT„ %[DA;D*j,|^f_;^] 



fceT„ 
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Therefore, the assumption A.l of Theorem B.6 holds under P^. Thanks to assumptions (H.l) and 
(H.4-5) we can easily prove that for some r > 2, one has supj.>Q E[||Dfc||''|C/^_^] < oo a.s. which in 
turn implies the Lindeberg condition A. 2. We can now conclude that under one has 



■AA(o,r). 



Finally Eq. (2) implies that X^fceT* — S„_i(^?„ — 0). Therefore, the result is a direct conse- 

quence of Lemma |C. 10| together with Slutsky's Lemma. □ 

We now turn to the asymptotic normality of ^ and p„. The direct application of the central 
limit theorem for martingales to af^ and p„ is not obvious because of the e(j^2fc+i)- proceed 
along the same lines as in the proof of the convergence of ct,?„, using the decomposition along the 
generations. However, this time we need a convergence rate for 0„ in order to apply Lemma 



B.l 



Theorem C.12 Under assumptions (H.0-6), one has 

'iog|T:„ii 



H\G*\>0}\\<'n 



o 



it: 



a.s. 



Proof : This result is based on the asymptotic behavior of the martingale (M„) defined as follows 



E E 



%,2fe+l)e(j,2fc+l), 
V S(^j,2k+l)X(j^k)£{j,2k+l) ) 



For all n > 2, we readily deduce from the definitions of the BAR process and of our estimator 0„ 
that 



s.-^E E 



^0-,2fc)e(j,2/c) 

%,2fc)-^(i,fc)e(j,2fc) 

'^(j,2fe+l)£(j,2fc+l) 
V \],2k^\)^(jM)£{],2k^\) ) 



The sharp asymptotic behavior of (M„) relies on properties of vector martingales. Thanks to 
Lemma B.4 the proof follows exactly the same lines as that of the first part of Theorem 3.2 of 
de Saporta et al (2011) and is not repeated here. □ 



We can now turn to the end of the proof of Theorem |4.2| concerning the asymptotic normality 
of (Tf „ and p„. 

Proof of Theorem asymptotic normality ofaf^ Thanks to Eq. Q and (jsj), we decompose 
„ — erf into two parts J7,\ and 



'^i) - E E E ^j,2k+i) ^0',2fe+i) 
j = l £=0 keGt-i 

m n — 1 m n—1 



EE E %,2fe+i)(£(j,2fc+») - 



EE E -U.) + EE E -kk, = K + v:, 

j = l e=0 keGe-i 3 = 1 l=Q fcGGf-i 



with 



5{j.2k+i) {{o-i ~ aixY + {^i ^ ^i^ifxfjj.^ + 2{ai — aix){hi — bij)Xi^j^k)) 
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We first deal with and study tlie limit of tt "/^J7^. Let us just detail the first term 



ri-l 



{t-n)/2_ 



i {ai -Qi^tf ( 1 



j = l 1=0 k£l 



On the one hand, Lemmas 



n-l 



in- 



5(j,2fe+i) 



j=i fcet 



XiP. 



B.5 



B.3 



and 



B.2 



imply that tt ^J2keGe-i ^U.2k+i) converges a.s. to a 
one has {at — ai^i)'^{£'K~^)~^ — 0(1) a.s. 



C.12 



finite limit. On the other hand, thanks to Theorem 

As a result, one obtains lim;_>.oo a^i/ — a.s. as tt > 1 by assumption. Therefore, Lemma |B . 1 1 yields 

rn 71 — 1 



lim ^j,2fe+j(aj-ai,^)^=0 a.s. 



The other terms in are dealt with similarly, using Lemma C.3 
obtains lim„_j.oo 7r~"/^?71 = a.s. and as a result Lemma 



B.3 



instead of Lemma 



in— i-oo |Tr„ 



* 1-1 



B.5| One 
= 0. 



yields lim 

Let us now deal with the martingale terms V^. Set V„ = Let us remark that 

T:r'^'V„ = M^Z^ with M(") = (M^")){p>i} the sequence of ^^-vector martingales defined by 



p m 



fc=i j=i 



and z/„ = 2" - 1 defined by We want now to apply Theorem [R6| to M("). Using 

Lemmas |C.3||C.9| together with Lemma [B.l | and Theorem |C. 12| along the same lines as above, we 
obtain the following limit conditionally to £ 



Hm < M = 



(i.2-^2^2)/,oi(o)^-i irt~af)z^ 



= 1 V 



Therefore, assumption A.l of Theorem B.6 holds under P^. Thanks to assumptions (H.l) and 
(H.4-5) we can prove that for some r > 2, supj,>Q E^[||t;^,||''|fJ^_-^] < oo a.s. which implies the 
Lindeberg condition. Therefore, we obtain that under ¥j 



If one sets 



(r4 - a4)(z0)-l (^2 _ ^2^2);^01(o)(^,0,l)-l 



one obtains the expected result using Slutsky's lemma. 



(9) 
□ 



Proof of Theorem 4-2 Asymptotic normality ofpn- Along the same lines, we show the central limit 
theorem for p„. One has 

m n— 1 

\^*n-l\iPn - P) = EE E (%",2fc)%,2fe+l) - «^(i.2fc)£0-,2fc + l)) 
m n— 1 ni n — 1 

= EE E %-,fe)+EE E ^[m^^n + v:,, 



j=i e=o k£i 



j = l i=0 feGGf_i 



with 



^\j,k} = Su.2k)^U,2k+i){{aQ -aoj){ai -ai^g) + {bo - boj){bi - bi^g)Xf^^,^y 
+((ao -aoe){bi - bij) + {bo - bo^e){ai -ai^e))X(^j^k)), 

^{j,k) = S(j,2k)S(j,2k+i) (((flo - ao.e) + {bo - bo,e)X(^j^k))£{j,2k+i) 

+{{ai —ai^e) + {bi — 6i,f)^(j,fc))£(j,2fc) + £{j,2k)£{j,2k+i) - p)- 
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Thanks to Theorem 



C.12 



it is easy to check that hni„- 



>oo 1-11,1-1 



,1/2 



[/' ~ a.s. Let us define 



new sequence of ^^-martingales (M^^^ ) by 



-1/2 



k=\ 



fe=i i=i 



We clearly have M,^"^ = |T,*o_\ j^^V^. We obtain the P^- a.s. limit 



lim \Tt\V^Y.^M\Q'^^i\-^ 

n^oc ^ — ^ 



So we have assumption A.l of Theorem |B.6| We also derive the Lindeberg condition A. 2. Conse- 
quently, we obtain that under P^, one has 



Setting 

completes the proof of Theorem |4.2[ 



2 2 



(10) 

□ 



C.4 Interval estimation and tests for the BAR process 

For all n > 1. define the 4x4 matrices r„ and Jl„ by 



r„ = it; 



/ CO cOl 

-p. <SS)l ^2 ol 



and 



Note that the matrix r„ is the empirical estimator of matrix T while is the empirical estimator 
of the asymptotic variance of On ^ 9. 



Theorem C.13 Under assumptions (H.0-6), for any < e < 1, the intervals 

[aO,n - gi-£/2(^^,\-l)l,i;a0,« + qi-c/2{^n-l)l,l\, ^O.n " gi_e/2 )2,2 ; ^0,n 

are asymptotic confidence intervals with level I — e of the parameters uq, bo, oi and bi respectively. 



91-6/2(^^^-1)2,2], 
1/2 



Proof This is a straightforward consequence of the central limit Theorem |4.2| together with Slut 



C.IO 



and Theorem 



sky's lemma as lim„_>.oo |ir*_i|Jl„_i — S ^FS ^ P^ a.s. thanks to Lemma 
□ 



Let 

7n m 

/J°1(0)=p<o)(1,1)|T:J-i^ E %,2fc)+i5li)(l,l)|T:j-i5] ^Uak+D 
i=ifeeT„_i j=ifeGT„_i 

be an empirical estimator of /i*'^(0) and 

be an empirical estimator of the variance term in the central limit theorem regarding (t|. 



4.1 
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Theorem C.14 Under assumptions (H.0-6), for any < e < 1, the intervals 
r /T \ 1/2 /T \ i/2n 

are asymptotic confidence intervals with level 1 — e o/ f/ie parameters af and p respectively. 



Proof This is a again straightforward consequence of the central limit Theorem 4.2 together with 
Slutsky's lemma as 



lim r„ 



T ^2 ^2 2 2 



almost surely thanks to Lemma B.5 and Theorems 4.1 and |C.li| 



□ 



We now propose two different symmetry tests for the BAR process based on the central limit 
Theorem |4. 2 [ The first one compares the couples (ao,&o) f^iid (ai,6i). Set 



• Hq! (ao,foo) = io-ijbi) the symmetry hypothesis, 

• H^: {ao,bo) ^ (ai,6i) the alternative hypothesis. 
Let (Y^)*Y^ be the test statistic defined by 



where 



A^ = iT:_iidge*a.-idgc, dgc = 



10-10 
10-1 



Theorem C.15 Under assumptions (H.0-6) and the null hypothesis Hq one has 



on (£,Pg-); and under the alternative hypothesis H^, almost surely on £ one has 

lim ||Y,S||2 = +00. 



Proof We mimic again the proof of Theorem B.9 with the function defined from onto by 
gc{x\, X2,X3, X4) — (xi — x^, X2 — 3^4) , so that dgc is the gradient of g^- □ 

Our next test compares the fixed points ao/(l — bo) and ai/(l — &i), which are the asymptotic 
means of X(^j 2k) and ^(j.2fc+i) respectively. Set 

• Hq! ao/(l — 60) = ai/(l — bi) the symmetry hypothesis, 

• H^^: ao/(l — ha) 7^ ai/(l — bi) the alternative hypothesis. 
Let {Y^Y be the test statistic defined by 

where A^ = |T;_ Jdgf *f2„_idgf , and dgf = (l/(l-6o,„), ao,„/(l-6o,„)^ -l/(l-6i,„), -ai,„/(l- 
61 „)^) . This test statistic has the following asymptotic properties. 

Theorem C.16 Under assumptions (H.0-6) and the null hypothesis Hq, one has 

on (f ,Pg-); and under the alternative hypothesis H^, almost surely on £ one has 

lim {Yi f = +00. 
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Proof We mimic again proof of Theorem B.9 with gf the function defined from M** onto M by 
gf{xi, X2,X3,X4) = {xi/ (1 — — ^3/(1 — X4)] , so that dgf is the gradient of gf. □ 

Finally, our last test compares the even and odd variances ctq and af of the noise sequence. Set 

• Hg: (Tq = cr^ the symmetry hypothesis, 

• HJ: Ctq ^ (j\ the alternative hypothesis. 
Let iY^Y' be the test statistic defined by 

where = |T*_]^|dgo.*ro..ri-idgo., and dgcr = (1,-1)*. This test statistic has the following 
asymptotic properties. 

Theorem C.17 Under assumptions (H.0-6) and the null hypothesis Hq, one has 

on (£^,Pg-); and under the alternative hypothesis HJ, almost surely on £ one has 

lim ^ +00 • 



Proof We mimic one last time the proof of Theorem B.9 with g^ the function defined from onto 
^ by gcr{xi,X2) — {xi — X2), so that dgcr is the gradient of g^- □ 
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